scientific publication
CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research
Variant and gene interpretation are fundamental to personalized medicine and translational biomedicine. However, traditional approaches are manual and labor-intensive. Generative language models (LMs) can facilitate this process, accelerating the translation of fundamental research into clinically-actionable insights. While existing benchmarks have attempted to quantify the capabilities of LMs for interpreting scientific data, these studies focus on narrow tasks that do not translate to real-world research. To meet these challenges, we introduce CGBench, a robust benchmark that tests reasoning capabilities of LMs on scientific publications.
Encoder Fine-tuning with Stochastic Sampling Outperforms Open-weight GPT in Astronomy Knowledge Extraction
Rawat, Shivam, Flek, Lucie, Karimi, Akbar
Scientific literature in astronomy is rapidly expanding, making it increasingly important to automate the extraction of key entities and contextual information from research papers. In this paper, we present an encoder-based system for extracting knowledge from astronomy articles. Our objective is to develop models capable of classifying telescope references, detecting auxiliary semantic attributes, and recognizing instrument mentions from textual content. To this end, we implement a multi-task transformer-based system built upon the SciBERT model and fine-tuned for astronomy corpora classification. To carry out the fine-tuning, we stochastically sample segments from the training data and use majority voting over the test segments at inference time. Our system, despite its simplicity and low-cost implementation, significantly outperforms the open-weight GPT baseline.
Annotating Satellite Images of Forests with Keywords from a Specialized Corpus in the Context of Change Detection
Neptune, Nathalie, Mothe, Josiane
The Amazon rain forest is a vital ecosystem that plays a crucial role in regulating the Earth's climate and providing habitat for countless species. Deforestation in the Amazon is a major concern as it has a significant impact on global carbon emissions and biodiversity. In this paper, we present a method for detecting deforestation in the Amazon using image pairs from Earth observation satellites. Our method leverages deep learning techniques to compare the images of the same area at different dates and identify changes in the forest cover. We also propose a visual semantic model that automatically annotates the detected changes with relevant keywords. The candidate annotation for images are extracted from scientific documents related to the Amazon region. We evaluate our approach on a dataset of Amazon image pairs and demonstrate its effectiveness in detecting deforestation and generating relevant annotations. Our method provides a useful tool for monitoring and studying the impact of deforestation in the Amazon. While we focus on environment applications of our work by using images of deforestation in the Amazon rain forest to demonstrate the effectiveness of our proposed approach, it is generic enough to be applied to other domains.
From scratch to silver: Creating trustworthy training data for patent-SDG classification using Large Language Models
Ascione, Grazia Sveva, Tamagnone, Nicolò
Classifying patents by their relevance to the UN Sustainable Development Goals (SDGs) is crucial for tracking how innovation addresses global challenges. However, the absence of a large, labeled dataset limits the use of supervised learning. Existing methods, such as keyword searches, transfer learning, and citation-based heuristics, lack scalability and generalizability. This paper frames patent-to-SDG classification as a weak supervision problem, using citations from patents to SDG-tagged scientific publications (NPL citations) as a noisy initial signal. To address its sparsity and noise, we develop a composite labeling function (LF) that uses large language models (LLMs) to extract structured concepts, namely functions, solutions, and applications, from patents and SDG papers based on a patent ontology. Cross-domain similarity scores are computed and combined using a rank-based retrieval approach. The LF is calibrated via a custom positive-only loss that aligns with known NPL-SDG links without penalizing discovery of new SDG associations. The result is a silver-standard, soft multi-label dataset mapping patents to SDGs, enabling the training of effective multi-label regression models. We validate our approach through two complementary strategies: (1) internal validation against held-out NPL-based labels, where our method outperforms several baselines including transformer-based models, and zero-shot LLM; and (2) external validation using network modularity in patent citation, co-inventor, and co-applicant graphs, where our labels reveal greater thematic, cognitive, and organizational coherence than traditional technological classifications. These results show that weak supervision and semantic alignment can enhance SDG classification at scale.
Towards Large Language Models for Lunar Mission Planning and In Situ Resource Utilization
Pekala, Michael, Canal, Gregory, Barham, Samuel, Graziano, Milena B., Trexler, Morgan, Hamilton, Leslie, Reilly, Elizabeth, Stiles, Christopher D.
A key factor for lunar mission planning is the ability to assess the local availability of raw materials. However, many potentially relevant measurements are scattered across a variety of scientific publications. In this paper we consider the viability of obtaining lunar composition data by leveraging LLMs to rapidly process a corpus of scientific publications. While leveraging LLMs to obtain knowledge from scientific documents is not new, this particular application presents interesting challenges due to the heterogeneity of lunar samples and the nuances involved in their characterization. Accuracy and uncertainty quantification are particularly crucial since many materials properties can be sensitive to small variations in composition. Our findings indicate that off-the-shelf LLMs are generally effective at extracting data from tables commonly found in these documents. However, there remains opportunity to further refine the data we extract in this initial approach; in particular, to capture fine-grained mineralogy information and to improve performance on more subtle/complex pieces of information.
Delving into the Utilisation of ChatGPT in Scientific Publications in Astronomy
Astarita, Simone, Kruk, Sandor, Reerink, Jan, Gómez, Pablo
Rapid progress in the capabilities of machine learning approaches in natural language processing has culminated in the rise of large language models over the last two years. Recent works have shown unprecedented adoption of these for academic writing, especially in some fields, but their pervasiveness in astronomy has not been studied sufficiently. To remedy this, we extract words that ChatGPT uses more often than humans when generating academic text and search a total of 1 million articles for them. This way, we assess the frequency of word occurrence in published works in astronomy tracked by the NASA Astrophysics Data System since 2000. We then perform a statistical analysis of the occurrences. We identify a list of words favoured by ChatGPT and find a statistically significant increase for these words against a control group in 2024, which matches the trend in other disciplines. These results suggest a widespread adoption of these models in the writing of astronomy papers. We encourage organisations, publishers, and researchers to work together to identify ethical and pragmatic guidelines to maximise the benefits of these systems while maintaining scientific rigour.
Unleashing the Power of AI. A Systematic Review of Cutting-Edge Techniques in AI-Enhanced Scientometrics, Webometrics, and Bibliometrics
Saeidnia, Hamid Reza, Hosseini, Elaheh, Abdoli, Shadi, Ausloos, Marcel
Purpose: The study aims to analyze the synergy of Artificial Intelligence (AI), with scientometrics, webometrics, and bibliometrics to unlock and to emphasize the potential of the applications and benefits of AI algorithms in these fields. Design/methodology/approach: By conducting a systematic literature review, our aim is to explore the potential of AI in revolutionizing the methods used to measure and analyze scholarly communication, identify emerging research trends, and evaluate the impact of scientific publications. To achieve this, we implemented a comprehensive search strategy across reputable databases such as ProQuest, IEEE Explore, EBSCO, Web of Science, and Scopus. Our search encompassed articles published from January 1, 2000, to September 2022, resulting in a thorough review of 61 relevant articles. Findings: (i) Regarding scientometrics, the application of AI yields various distinct advantages, such as conducting analyses of publications, citations, research impact prediction, collaboration, research trend analysis, and knowledge mapping, in a more objective and reliable framework. (ii) In terms of webometrics, AI algorithms are able to enhance web crawling and data collection, web link analysis, web content analysis, social media analysis, web impact analysis, and recommender systems. (iii) Moreover, automation of data collection, analysis of citations, disambiguation of authors, analysis of co-authorship networks, assessment of research impact, text mining, and recommender systems are considered as the potential of AI integration in the field of bibliometrics. Originality/value: This study covers the particularly new benefits and potential of AI-enhanced scientometrics, webometrics, and bibliometrics to highlight the significant prospects of the synergy of this integration through AI.
Logic Mill -- A Knowledge Navigation System
Erhardt, Sebastian, Ghosh, Mainak, Buunk, Erik, Rose, Michael E., Harhoff, Dietmar
Logic Mill is a scalable and openly accessible software system that identifies semantically similar documents within either one domain-specific corpus or multi-domain corpora. It uses advanced Natural Language Processing (NLP) techniques to generate numerical representations of documents. Currently it leverages a large pre-trained language model to generate these document representations. The system focuses on scientific publications and patent documents and contains more than 200 million documents. It is easily accessible via a simple Application Programming Interface (API) or via a web interface. Moreover, it is continuously being updated and can be extended to text corpora from other domains. We see this system as a general-purpose tool for future research applications in the social sciences and other domains.
Scientific Fact-Checking: A Survey of Resources and Approaches
Vladika, Juraj, Matthes, Florian
The task of fact-checking deals with assessing the veracity of factual claims based on credible evidence and background knowledge. In particular, scientific fact-checking is the variation of the task concerned with verifying claims rooted in scientific knowledge. This task has received significant attention due to the growing importance of scientific and health discussions on online platforms. Automated scientific fact-checking methods based on NLP can help combat the spread of misinformation, assist researchers in knowledge discovery, and help individuals understand new scientific breakthroughs. In this paper, we present a comprehensive survey of existing research in this emerging field and its related tasks. We provide a task description, discuss the construction process of existing datasets, and analyze proposed models and approaches. Based on our findings, we identify intriguing challenges and outline potential future directions to advance the field.